Data visualization is a critical step in the data analysis process, since this is the step that allow us to go from raw data to interpretable insights. It is not only important for communication purposes, where an effective chart is invaluable for conveying complex ideas (e.g., papers, presentations…), but also for us as analysts. Visualization helps analysts explore their data more intuitively, identify biases, and uncover hidden patterns or outliers that might otherwise be unnoticed. In bioinformatics and omics data analysis, where datasets are often vast and complex, this step is absolutely essential.
Data visualization is probably one of the main strengths of R. It offers two main ways to make plots:
ggplot2, the most popular R package with data analysts
because of its flexibility.I will introduce both of them, but it will be your choice when to use them.
R-base (without the installation of any other R package) can produce a variate of plots. Here, only some of them will explained, plotting in general is an art that can be only mastered through practice, so keep it up!
x <- 1:10
y <- 2 * x
plot(
x, y, xlab = "This is the label for the x axis",
ylab = "Label for the y axis",
main = "Random plot",
col = "darkred", cex = 1.5
)
abline(h = 5, lty = 2, col = "darkblue")
abline(v = 6.5, lty = 1, col = "gold")
You can change the color and shape of points using the
col and pch arguments, respectively.
## it creates an empty canvas where other plot variables can draw
plot(c(1, 21), c(1, 2.3), type = "n", axes = FALSE, ann = FALSE)
## points by color (col)
points(1:20, rep(2, 20), pch = 16, col = rainbow(20))
text(11, 2.2, "col", cex = 1.3)
## points by shape (pch)
points(1:20, rep(1, 20), pch = 1:20)
text(1:20, 1.2, labels = 1:20)
text(11, 1.5, "pch", cex = 1.3)
To understand the parameters, what you need to modify if you want to
change anything of the plot, etc, please, read ?plot().
This is the only way. Let’s put some examples that may serve you as a
guide.
set.seed(123)
df.data <- data.frame(
X = rnorm(1000, mean = 1000, sd = 10),
Y = rpois(100, n = 1000),
Category.1 = factor(sample(LETTERS[1:3], size = 1000, replace = T)),
Z = rnorm(1000, mean = 12, sd = 6)
)
with(df.data, plot(Y ~ X, col = "darkgreen"))
plot(df.data$X, df.data$Z, col = "darkgreen")
plot(df.data$X, df.data$Y, col = df.data$Category.1)
## why the line of code below works?? Try to understand it
plot(df.data$X, df.data$Y, col = c("blue", "gray", "red")[df.data$Category.1])
plot(
df.data$X,
df.data$Y,
col = c("blue", "gray", "red")[df.data$Category.1], pch = 19
)
# Add a legend
legend(
"topleft",
legend = levels(df.data$Category.1),
col = c("blue", "gray", "red")[factor(levels(df.data$Category.1))],
pch = 19
)
hist(df.data$Y, breaks = 25, col = "pink")
hist(df.data$X, breaks = 50, col = "lightblue")
We can also make several plots in different panels using the
par() function:
par(mfrow = c(1, 2))
hist(df.data$Y, breaks = 25, col = "pink")
hist(df.data$X, breaks = 50, col = "lightblue")
dev.off()
## null device
## 1
boxplot(Y ~ Category.1, df.data, col = c("blue", "gray", "red"), main = "Bolxplot")
with(
df.data,
boxplot(Z ~ Category.1, col = c("blue", "gray", "red"), frame = FALSE)
)
stripchart(
X ~ Category.1, df.data, vertical = TRUE, pch = 17,
col = c("blue", "gray", "red")
)
barplot(
head(df.data$X, 10), names.arg = paste0("Sample-", c(1:10)),
las = 2,
col = c("blue", "gray", "red")[head(df.data$Category.1, 10)],
main = "Barplot of first 10 samples"
)
abline(h = 0)
There are many more options to make plots, make them more appealing, etc. If you’re keen on learning more about R-base plots, I recommend this chapter focused on it: https://intro2r.com/simple-base-r-plots.html.
ggplot2ggplot2 is an R package meant to generate high quality
plots. It works a bit different from R-base, and it is the preferred
option for many users. From my view, once you understand the philosophy
behind it, it is quite intuitive and flexible, allowing to create very
complex plots just by adding layers.
ggplot2 is that intuitive thanks to using a grammar of
graphics: it follows a set of rules that, if correctly applied, allow a
user to build complex plots very easily. One of the main differences
between ggplot2 ad R-base is that the former is only
designed to work on tables in tidy format. From my view, it is great!
But it forces users to convert data into it. We haven’t talked about
this way of organizing data, so this chapter will serve as an
introduction.
Any ggplot2 plot is comprised of three main
components:
The general way to make plots with ggplot2 is to
construct it part by part by adding layers to an initial ggplot object
that contains the data. These layers will define geometries, compute
summary statistics, change scales, define colors, etc.
We are going to use a classic dataset usually used for teaching: the Iris dataset. It comprises measurements of iris flowers from three different species: Setosa, Versicolor, and Virginica. Each sample consists of four features: sepal length, sepal width, petal length, and petal width. Additionally, each sample is labeled with its corresponding species
library("ggplot2")
library("datasets")
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We have 150 samples (rows) defined by 5 variables: 4 continuous
variables and 1 categorical variable (species). Let’s use it to
illustrate some plots that can be made using ggplot2.
Plotting using ggplot2 always follows the same
structure:
ggplot() function creates an object. It receives a
data frame with the data to be represented, and a list of aesthetic
mappings to use for plot. It is in the latter where we set the variables
that will be plotted. This object is created using the
aes() function. In this step, the canvas + the field where
data will be mapped are created.## canvas, just a ggplot object
ggplot()
## canvas + the field where variables will be mapped. In this case, we take
# Sepal.Length and Sepal.Width columns from the iris data frame
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width))
ggplot2 displays information or how the data are going to
be mapped in the field we just created. Let’s make a simple scatter plot
using geom_point(). To add geometries, we use the
+ operator.## canvas + the field where variables will be mapped
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Let’s make some examples changing colors, including new geometries, etc.
We can color points according to different variables and modify some
specific parts of the plot using the theme() function,
whose objective is to modify the aesthetics of the plot.
ggplot(
data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)
) + geom_point() + ggtitle("ggtitle() for the title, theme() for the aspect") +
theme(plot.title = element_text(face = "bold", color = "red"))
We can add additional geometries:
ggplot(
data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width)
) +
geom_point(color = "darkgreen") +
geom_smooth(method = 'lm', color = "lightgreen") +
ggtitle("Check ?geom_smooth()") +
theme(plot.title = element_text(face = "bold"))
## `geom_smooth()` using formula = 'y ~ x'
If we group our samples according to the species they belong to,
geom_smooth() will also be grouped.
ggplot(
data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)
) + geom_point() + geom_smooth(method = 'lm') +
ggtitle("Check ?geom_smooth()!!") +
theme(plot.title = element_text(face = "bold"))
## `geom_smooth()` using formula = 'y ~ x'
By using the facet_wrap() function, we can split our
data into different panels:
## canvas + the field where variables will be mapped
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() + geom_smooth(method = 'lm') + facet_wrap(~ Species) +
ggtitle("Using factet_wrap()") +
theme(plot.title = element_text(face = "bold"))
## `geom_smooth()` using formula = 'y ~ x'
Importantly, not only ggplot() is able to receive
mappings. We can include an additional mapping to each geometry. This is
specially important when making complex and multi-layered plots.
Otherwise, for most of the situations, including the additional
variables in ggplot() will be enough.
For instance, in the following plot we want our points to be black,
but the regression lines to have different colors. Hence, we need to set
the color in geom_smooth() and not in
ggplot(), since this would apply to the whole plot.
ggplot(
data = iris,
mapping = aes(x = Sepal.Length, y = Sepal.Width)
) + geom_point() + geom_smooth(aes(color = Species), method = 'lm') +
ggtitle("Only coloring lm lines") +
theme(plot.title = element_text(face = "bold"))
## `geom_smooth()` using formula = 'y ~ x'
We can also color the points using a continuous variable and change the shape of the points:
ggplot(
data = iris,
mapping = aes(
x = Sepal.Length, y = Sepal.Width, color = Petal.Length, shape = Species
)
) + geom_point(size = 2.5) + xlab("Speal length") + ylab("Speal width")
The greatest part of ggplot2 is that all these rules
apply in the same way for any kind of plot we want to make.
ggplot(data = iris, mapping = aes(x = Petal.Length)) +
geom_histogram() +
ggtitle("Histogram of Petal length") +
theme(plot.title = element_text(face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The alpha parameter allows us to change the transparency
of the colors:
ggplot(data = iris, mapping = aes(x = Petal.Length, fill = Species)) +
geom_histogram(alpha = 0.8) +
ggtitle("Histogram of petal length") +
theme(plot.title = element_text(face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can also define our own colors:
ggplot(data = iris, mapping = aes(x = Petal.Length, fill = Species)) +
geom_histogram(alpha = 0.8) +
ggtitle("Histogram of petal length") +
scale_fill_manual(values = c("gold", "lightblue", "orange")) +
theme(plot.title = element_text(face = "bold"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Instead of a histogram, we can create a density plot (both convey the same information), and change the theme used to make the plot:
ggplot(data = iris, mapping = aes(x = Petal.Length, fill = Species)) +
geom_density(alpha = 0.8) +
ggtitle("Histogram of petal length") +
labs(y = "Density", x = "Petal length", fill = "Species under study") +
scale_fill_manual(values = c("gold", "lightblue", "orange")) +
theme_classic() + theme(plot.title = element_text(face = "bold"))
theme_classic() is just one of the multiple available
themes.
ggplot(data = iris, mapping = aes(x = Species, y = Petal.Width, fill = Species)) +
geom_boxplot(alpha = 0.8) +
ggtitle("Boxplot of petal width") +
labs(y = "Petal width", x = "Species under study", fill = "Species") +
scale_fill_manual(values = c("gold", "lightblue", "orange")) +
geom_hline(yintercept = 1, color = "red", linetype = "dashed") +
theme_classic() + theme(plot.title = element_text(face = "bold"))
There are many more geometries that can be used to plot your data. If you have an idea about how to represent a determined dataset, just check on the Internet. Also, you can explore the following links where you’ll find more examples: